Proper noun recognition in end-to-end speech recognition

ABSTRACT

A method for training a speech recognition model with a minimum word error rate loss function includes receiving a training example comprising a proper noun and generating a plurality of hypotheses corresponding to the training example. Each hypothesis of the plurality of hypotheses represents the proper noun and includes a corresponding probability that indicates a likelihood that the hypothesis represents the proper noun. The method also includes determining that the corresponding probability associated with one of the plurality of hypotheses satisfies a penalty criteria. The penalty criteria indicating that the corresponding probability satisfies a probability threshold, and the associated hypothesis incorrectly represents the proper noun. The method also includes applying a penalty to the minimum word error rate loss function.

CROSS REFERENCE TO RELATED APPLICATIONS

This U.S. patent application is a continuation of, and claims priorityunder 35 U.S.C. § 120 from, U.S. patent application Ser. No. 17/150,491,filed on Jan. 15, 2021, which claims priority under 35 U.S.C. § 119(e)to U.S. Provisional Application 62/966,823, filed on Jan. 28, 2020. Thedisclosures of these prior applications are considered part of thedisclosure of this application and are hereby incorporated by referencein their entireties.

TECHNICAL FIELD

This disclosure relates to proper noun recognition end-to-end speechrecognition.

BACKGROUND

Modern automated speech recognition (ASR) systems focus on providing notonly high quality (e.g., a low word error rate (WER)), but also lowlatency (e.g., a short delay between the user speaking and atranscription appearing). Moreover, when using an ASR system today thereis a demand that the ASR system decode utterances in a streaming fashionthat corresponds to real-time or even faster than real-time. Toillustrate, when an ASR system is deployed on a mobile phone thatexperiences direct user interactivity, an application on the mobilephone using the ASR system may require the speech recognition to bestreaming such that words appear on the screen as soon as they arespoken. Here, it is also likely that the user of the mobile phone has alow tolerance for latency. Due to this low tolerance, the speechrecognition strives to run on the mobile device in a manner thatminimizes an impact from latency and inaccuracy that may detrimentallyaffect the user's experience.

SUMMARY

One aspect of the present disclosure provides a computer-implementedmethod that when executed on data processing hardware causes the dataprocessing hardware to perform operations that include training a speechrecognition model with a minimum word error rate loss function by:receiving a training example including a proper noun; generating aplurality of hypotheses corresponding to the training example, eachhypothesis of the plurality of hypotheses representing the proper nounand comprising a corresponding probability that indicates a likelihoodthat the hypothesis represents the proper noun; determining that thecorresponding probability associated with one of the plurality ofhypotheses satisfies a penalty criteria; and applying a penalty to theminimum word error rate loss function. The penalty criteria indicatesthat the corresponding probability satisfies a probability threshold,and the associated hypothesis incorrectly represents the proper noun.

Implementations of the disclosure may include one or more of thefollowing optional features. In some implementations, the speechrecognition model includes a two-pass architecture including: a firstpass network including a recurrent neural network transducer (RNN-T)decoder; and a second pass network including a listen-attend-spell (LAS)decoder. In these implementations, the speech recognition model mayfurther include a shared encoder that encodes acoustic frames for eachof the first pass network and the second pass network. The training withthe minimum word error rate loss function in these implementations mayoccur at the LAS encoder. The operations may further include trainingthe RNN-T decoder, and prior to training the LAS decoder with theminimum word error rate loss function, training the LAS decoder whileparameters of the trained RNN-T decoder remain fixed.

In some examples, the corresponding probability satisfies theprobability threshold when the corresponding probability is greater thanthe corresponding probabilities associated with the other hypothesis.The operations may further include assigning the probability to eachhypothesis of the plurality of hypotheses. In some implementations, theoperations further include receiving an incorrect hypothesis andassigning a respective probability to the incorrect hypothesis, whereinthe penalty criteria further includes an indication that the hypothesisincludes the generated incorrect hypothesis. In these examples, theincorrect hypothesis may include a phonetically similarity to the propernoun and/or the operations may further include substituting theincorrect hypothesis for a generated hypothesis of the plurality ofhypotheses.

Another aspect of the disclosure provides a system that includes dataprocessing hardware and memory hardware in communication with the dataprocessing hardware and storing instructions that when executed by thedata processing hardware cause the data processing hardware to performoperations that include training a speech recognition model with aminimum word error rate loss function by: receiving a training exampleincluding a proper noun; generating a plurality of hypothesescorresponding to the training example, each hypothesis of the pluralityof hypotheses representing the proper noun and comprising acorresponding probability that indicates a likelihood that thehypothesis represents the proper noun; determining that thecorresponding probability associated with one of the plurality ofhypotheses satisfies a penalty criteria; and applying a penalty to theminimum word error rate loss function. The penalty criteria indicatesthat the corresponding probability satisfies a probability threshold,and the associated hypothesis incorrectly represents the proper noun.

This aspect may include one or more of the following optional features.In some implementations, the system further includes a first passnetwork comprising a recurrent neural network transducer (RNN-T)decoder, and a second pass network comprising a listen-attend-spell(LAS) decoder, wherein the speech recognition model comprises the firstpass network and the second pass network. In these implementations, thesystem may also include a shared encoder configured to encode acousticframes for each of the first pass network and the second pass network.Training with the minimum word error rate loss function in theseimplementations may occur at the LAS decoder. The operations may furtherinclude training the RNN-T decoder, and prior to training the LASdecoder with the minimum word error rate loss function, training the LASdecoder while parameters of the trained RNN-T decoder remain fixed.

In some examples, the corresponding probability satisfies theprobability threshold when the corresponding probability is greater thanthe corresponding probabilities associated with the other hypothesis.The operations may further include assigning the probability to eachhypothesis of the plurality of hypotheses. In some implementations, theoperations further include receiving an incorrect hypothesis andassigning a respective probability to the incorrect hypothesis, whereinthe penalty criteria further includes an indication that the hypothesisincludes the generated incorrect hypothesis. In these examples, theincorrect hypothesis may include a phonetically similarity to the propernoun and/or the operations may further include substituting theincorrect hypothesis for a generated hypothesis of the plurality ofhypotheses.

The details of one or more implementations of the disclosure are setforth in the accompanying drawings and the description below. Otheraspects, features, and advantages will be apparent from the descriptionand drawings, and from the claims.

DESCRIPTION OF DRAWINGS

FIGS. 1A and 1B are schematic views of example speech environments usinga two-pass speech recognition architecture with a joint acoustic andtext model.

FIG. 2 is a schematic view of an example two-pass speech recognitionarchitecture for speech recognition.

FIGS. 3A-3C are schematic views of example training procedures fortraining the speech recognition two-pass architecture of FIG. 2 .

FIG. 4 is a flowchart of example arrangement of operations for a methodof training the two-pass speech recognition architecture of FIG. 2 .

FIG. 5 is a schematic view of an example computing device that may beused to implement the systems and methods described herein.

Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

Speech recognition continues to evolve to meet the untethered and thenimble demands of a mobile environment. New speech recognitionarchitectures or improvements to existing architectures continue to bedeveloped that seek to increase the quality of automatic speechrecognition systems (ASR). To illustrate, speech recognition initiallyemployed multiple models where each model had a dedicated purpose. Forinstance, an ASR system included an acoustic model (AM), a pronunciationmodel (PM), and a language model (LM). The acoustic model mappedsegments of audio (i.e., frames of audio) to phonemes. The pronunciationmodel connected these phonemes together to form words while the languagemodel was used to express the likelihood of given phrases (i.e., theprobability of a sequence of words). Yet although these individualmodels worked together, each model was trained independently and oftenmanually designed on different datasets.

The approach of separate models enabled a speech recognition system tobe fairly accurate, especially when the training corpus (i.e., body oftraining data) for a given model caters to the effectiveness of themodel, but needing to independently train separate models introduced itsown complexities and led to an architecture with integrated models.These integrated models sought to use a single neural network todirectly map an audio waveform (i.e., input sequence) to an outputsentence (i.e., output sequence). This resulted in asequence-to-sequence approach, which generated a sequence of words (orgraphemes) when given a sequence of audio features. Examples ofsequence-to-sequence models include “attention-based” models and“listen-attend-spell” (LAS) models. A LAS model transcribes speechutterances into characters using a listener component, an attendercomponent, and a speller component. Here, the listener is a recurrentneural network (RNN) encoder that receives an audio input (e.g., atime-frequency representation of speech input) and maps the audio inputto a higher-level feature representation. The attender attends to thehigher-level feature to learn an alignment between input features andpredicted subword units (e.g., a grapheme or a wordpiece). The spelleris an attention-based RNN decoder that generates character sequencesfrom the input by producing a probability distribution over a set ofhypothesized words. With an integrated structure, all components of amodel may be trained jointly as a single end-to-end (E2E) neuralnetwork. Here, an E2E model refers to a model whose architecture isconstructed entirely of a neural network. A fully neural networkfunctions without external and/or manually designed components (e.g.,finite state transducers, a lexicon, or text normalization modules).Additionally, when training E2E models, these models generally do notrequire bootstrapping from decision trees or time alignments from aseparate system.

Although early E2E models proved accurate and a training improvementover individually trained models, these E2E models, such as the LASmodel, functioned by reviewing an entire input sequence beforegenerating output text, and thus, did not allow streaming outputs asinputs were received. Without streaming capabilities, an LAS model isunable to perform real-time voice transcription. Due to this deficiency,deploying the LAS model for speech applications that are latencysensitive and/or require real-time voice transcription may pose issues.This makes an LAS model alone not an ideal model for mobile technology(e.g., mobile phones) that often relies on real-time applications (e.g.,real-time communication applications).

Additionally, speech recognition systems that have acoustic,pronunciation, and language models, or such models composed together,may rely on a decoder that has to search a relatively large search graphassociated with these models. With a large search graph, it is notconducive to host this type of speech recognition system entirelyon-device. Here, when a speech recognition system is hosted “on-device,”a device that receives the audio input uses its processor(s) to executethe functionality of the speech recognition system. For instance, when aspeech recognition system is hosted entirely on-device, the processorsof the device do not need to coordinate with any off-device computingresources to perform the functionality of the speech recognition system.A device that performs speech recognition not entirely on-device relieson remote computing (e.g., of a remote computing system or cloudcomputing) and therefore online connectivity to perform at least somefunction of the speech recognition system. For example, a speechrecognition system performs decoding with a large search graph using anetwork connection with a server-based model.

Unfortunately, being reliant upon a remote connection makes a speechrecognition system vulnerable to latency issues and/or inherentunreliability of communication networks. To improve the usefulness ofspeech recognition by avoiding these issues, speech recognition systemsagain evolved into a form of a sequence-to-sequence model known as arecurrent neural network transducer (RNN-T). A RNN-T does not employ anattention mechanism and, unlike other sequence-to-sequence models thatgenerally need to process an entire sequence (e.g., audio waveform) toproduce an output (e.g., a sentence), the RNN-T continuously processesinput samples and streams output symbols, a feature that is particularlyattractive for real-time communication. For instance, speech recognitionwith an RNN-T may output characters one-by-one as spoken. Here, an RNN-Tuses a feedback loop that feeds symbols predicted by the model back intoitself to predict the next symbols. Because decoding the RNN-T includesa beam search through a single neural network instead of a large decodergraph, an RNN-T may scale to a fraction of the size of a server-basedspeech recognition model. With the size reduction, the RNN-T may bedeployed entirely on-device and able to run offline (i.e., without anetwork connection); therefore, avoiding unreliability issues withcommunication networks.

In addition to speech recognition systems operating with low latency, aspeech recognition system also needs to be accurate at recognizingspeech. Often for models that perform speech recognition, a metric thatmay define an accuracy of a model is a word error rate (WER). A WERrefers to a measure of how many words are changed compared to a numberof words actually spoken. Commonly, these word changes refer tosubstitutions (i.e., when a word gets replaced), insertions (i.e., whena word is added), and/or deletions (i.e., when a word is omitted). Toillustrate, a speaker says “car,” but an ASR system transcribes the word“car” as “bar.” This is an example of a substitution due to phoneticsimilarity. When measuring the capability of an ASR system compared toother ASR systems, the WER may indicate some measure of improvement orquality capability relative to another system or some baseline.

Although an RNN-T model showed promise as a strong candidate model foron-device speech recognition, the RNN-T model alone still lags behind alarge state-of-the-art conventional model (e.g., a server-based modelwith separate AM, PM, and LMs) in terms of quality (e.g., speechrecognition accuracy). Yet a non-streaming E2E, LAS model has speechrecognition quality that is comparable to large state-of-the-artconventional models. To capitalize on the quality of a non-steaming E2ELAS model, a two-pass speech recognition system (e.g., shown in FIG. 2 )developed that includes a first-pass component of an RNN-T networkfollowed by a second-pass component of a LAS network. With this design,the two-pass model benefits from the streaming nature of an RNN-T modelwith low latency while improving the accuracy of the RNN-T model throughthe second-pass incorporating the LAS network. Although the LAS networkincreases the latency when compared to only a RNN-T model, the increasein latency is reasonably slight and complies with latency constraintsfor on-device operation. With respect to accuracy, a two-pass modelachieves a 17-22% WER reduction when compared to a RNN-T alone and has asimilar WER when compared to a large conventional model.

Yet a two-pass model with an RNN-T network first pass and a LAS networksecond pass still has its tradeoffs, particularly with rare or uncommonwords. These types of words may be referred to as tail utterances andare inherently more difficult for speech systems to transcribe by virtueof their ambiguity, rareness in training, or unusual verbalization.Examples of tail utterances include accented speech, cross-lingualspeech, numerics, and proper nouns. For instance, proper nouns present achallenge for streaming ASR with a two-pass model because a particularname may appear only rarely, or not at all, during training yetpotentially have a pronunciation that is similar to a more common word.Traditionally, a conventional model may optimize a pronunciation model(PM) to improve tail performance by injecting knowledge of thepronunciation of proper nouns. Unfortunately, a two-pass architecturelacks an explicit pronunciation model (PM) that can be specificallytrained with proper noun pronunciations and a language model (LM) thatcan be trained on a large corpus with greater exposure to proper nouns.Without a PM as a specific site for the injection of proper knownknowledge in a streaming two-pass system, it is more difficult to modelspecific requirements like proper noun pronunciation. Although, somemodels have attempted to improve on issues with uncommon/rare words byincorporating additional training data or models, these techniquesincrease model size, training time, and/or inference cost.

To increase a two-pass model's effectiveness on proper nouns and/orother tail utterances, the two-pass architecture uses a customizedminimum word error rate (MWER) loss criteria. This loss criteriaspecifically seeks to emphasize proper noun recognition. By using losscriteria to improve proper noun recognition, the speech recognitionsystem does not need new data during training or external models duringinference. Here, two different methods of loss criteria may be used forproper noun recognition. The first method includes an entity taggingsystem that identifies proper nouns in ground truth transcripts andincreases the loss for hypotheses that miss a proper noun duringtraining. The second method injects additional hypotheses to the MWERbeam where the additional hypotheses correspond to proper nouns thathave been replaced by phonetically similar alternatives. For instance,an additional hypothesis of “Hallmark” is added as a phoneticallysimilar alternative to “Walmart.” In the second approach, the process oftraining makes the model aware of possible mistakes and potentialalternatives. On a variety of proper noun test sets, these custom losscriteria methods may achieve a 2-7% relative reduction in WER whencompared to a traditional two-pass architecture without custom losscriteria.

FIGS. 1A and 1B are examples of a speech environment 100. In the speechenvironment 100 a user's 10 manner of interacting with a computingdevice, such as a user device 110, may be through voice input. The userdevice 110 (also referred to generally as a device 110) is configured tocapture sounds (e.g., streaming audio data) from one or more users 10within the speech-enabled environment 100. Here, the streaming audiodata 12 may refer to a spoken utterance by the user 10 that functions asan audible query, a command for the device 110, or an audiblecommunication captured by the device 110. Speech-enabled systems of thedevice 110 may field the query or the command by answering the queryand/or causing the command to be performed.

The user device 110 may correspond to any computing device associatedwith a user 10 and capable of receiving audio data 12. Some examples ofuser devices 110 include, but are not limited to, mobile devices (e.g.,mobile phones, tablets, laptops, etc.), computers, wearable devices(e.g., smart watches), smart appliances, internet of things (IoT)devices, smart speakers, etc. The user device 110 includes dataprocessing hardware 112 and memory hardware 114 in communication withthe data processing hardware 112 and storing instructions, that whenexecuted by the data processing hardware 112, cause the data processinghardware 112 to perform one or more operations. The user device 110further includes an audio subsystem 116 with an audio capture device(e.g., microphone) 116, 116 a for capturing and converting spokenutterances 12 within the speech-enabled system 100 into electricalsignals and a speech output device (e.g., a speaker) 116, 116 b forcommunicating an audible audio signal (e.g., as output audio data fromthe device 110). While the user device 110 implements a single audiocapture device 116 a in the example shown, the user device 110 mayimplement an array of audio capture devices 116 a without departing fromthe scope of the present disclosure, whereby one or more capture devices116 a in the array may not physically reside on the user device 110, butbe in communication with the audio subsystem 116. The user device 110(e.g., using the hardware 112, 114) is further configured to performspeech recognition processing on the streaming audio data 12 using aspeech recognizer 200. In some examples, the audio subsystem 116 of theuser device 110 that includes the audio capture device 116 a isconfigured to receive audio data 12 (e.g., spoken utterances) and toconvert the audio data 12 into a digital format compatible with thespeech recognizer 200. The digital format may correspond to acousticframes (e.g., parameterized acoustic frames), such as mel frames. Forinstance, the parameterized acoustic frames correspond to log-melfiterbank energies.

In some examples, such as FIG. 1A, the user 10 interacts with a programor application 118 of the user device 110 that uses the speechrecognizer 200. For instance, FIG. 1A depicts the user 10 communicatingwith an automated assistant application. In this example, the user 10asks the automated assistant, “What time is the concert tonight?” Thisquestion from the user 10 is a spoken utterance 12 captured by the audiocapture device 116 a and processed by audio subsystems 116 of the userdevice 110. In this example, the speech recognizer 200 of the userdevice 110 receives the audio input 202 (e.g., as acoustic frames) of“what time is the concert tonight” and transcribes the audio input 202into a transcription 204 (e.g., a text representation of “what time isthe concert tonight?”). Here, the automated assistant of the application118 may respond to the question posed by the user 10 using naturallanguage processing. Natural language processing generally refers to aprocess of interpreting written language (e.g., the transcription 204)and determining whether the written language prompts any action. In thisexample, the automated assistant uses natural language processing torecognize that the question from the user 10 regards the user's scheduleand more particularly a concert on the user's schedule. By recognizingthese details with natural language processing, the automated assistantreturns a response to the user's query where the response states, “Doorsopen at 8:30 pm for the concert tonight.” In some configurations,natural language processing may occur on a remote system incommunication with the data processing hardware 112 of the user device110.

FIG. 1B is another example of speech recognition with the speechrecognizer 200. In this example, the user 10 associated with the userdevice 110 is communicating with a friend named Jane Doe with acommunication application 118. Here, the user 10 named Ted, communicateswith Jane by having the speech recognizer 200 transcribe his voiceinputs. The audio capture device 116 captures these voice inputs andcommunicates them in a digital form (e.g., acoustic frames) to thespeech recognizer 200. The speech recognizer 200 transcribes theseacoustic frames into text that is sent to Jane via the communicationapplication 118. Because this type of application 118 communicates viatext, the transcription 204 from the speech recognizer 200 may be sentto Jane without further processing (e.g., natural language processing).

In some examples, such as FIGS. 2 , the speech recognizer 200 isconfigured in a two-pass speech recognition architecture (or simply“two-pass architecture”). Generally speaking, the two-pass architectureof the speech recognizer 200 includes at least one encoder 210, an RNN-Tdecoder 220, and a LAS decoder 230. In two-pass decoding, the secondpass 208 (e.g., shown as the LAS decoder 230) may improve the initialoutputs from the first pass 206 (e.g., shown as the RNN-T decoder 220)with techniques such as lattice rescoring or n-best re-ranking. In otherwords, the RNN-T decoder 220 produces streaming predictions (e.g., a setof N-best hypotheses) and the LAS decoder 230 finalizes the prediction(e.g., identifies 1-best rescored hypothesis). Here, specifically, theLAS decoder 230 rescores streamed hypotheses y_(R) from the RNN-Tdecoder 220. Although it is generally discussed that the LAS decoder 230functions in a rescoring mode that rescores streamed hypotheses y_(R)from the RNN-T decoder 220, the LAS decoder 230 is also capable ofoperating in different modes, such as a beam search mode, depending ondesign or other factors (e.g., utterance length).

The at least one encoder 210 is configured to receive, as an audio input202, acoustic frames corresponding to streaming audio data 12. Theacoustic frames may be previously processed by the audio subsystem 116into parameterized acoustic frames (e.g., mel frames and/or spectralframes). In some implementations, the parameterized acoustic framescorrespond to log-mel filterbank energies with log-mel features. Forinstance, the parameterized input acoustic frames that are output by theaudio subsystem 116 and that are input into the encoder 210 may berepresented as x=(x₁, . . . , x_(T)), where x_(t) ∈

^(d) are log-mel filterbank energies, T denotes the number of frames inx, and d represents the number of log-Mel features. In some examples,each parameterized acoustic frame includes 128-dimensional log-melfeatures computed within a short shifting window (e.g., 32 millisecondsand shifted every 10 milliseconds). Each feature may be stacked withprevious frames (e.g., three previous frames) to form ahigher-dimensional vector (e.g., a 512-dimensional vector using thethree previous frames). The features forming the vector may then bedownsampled (e.g., to a 30 millisecond frame rate). Based on the audioinput 202, the encoder 210 is configured to generate an encoding e. Forexample, the encoder 210 generates encoded acoustic frames (e.g.,encoded mel frames or acoustic embeddings).

Although the structure of the encoder 210 may be implemented indifferent ways, in some implementations, the encoder 210 is a long-shortterm memory (LSTM) neural network. For instance, the encoder 210includes eight LSTM layers. Here, each layer may have 2,048 hidden unitsfollowed by a 640-dimensional projection layer. In some examples, atime-reduction layer is inserted with the reduction factor N=2 after thesecond LSTM layer of encoder 210.

In some configurations, the encoder 210 is a shared encoder network. Inother words, instead of each pass network 206, 208 having its ownseparate encoder, each pass 206, 208 shares a single encoder 210. Bysharing an encoder, an ASR speech recognizer 200 that uses a two-passarchitecture may reduce its model size and/or its computational cost.Here, a reduction in model size may help enable the speech recognizer200 to function well entirely on-device.

In some examples, the speech recognizer 200 of FIG. 2 also includes anadditional encoder, such as an acoustic encoder 240, to adapt theencoder 210 output 212 to be suitable for the second pass 208 of the LASdecoder 230. The acoustic encoder 240 is configured to further encodethe output 212 into the encoded output 252. In some implementations, theacoustic encoder 240 is a LSTM encoder (e.g., a two-layer LSTM encoder)that further encodes the output 212 from the encoder 210. By includingan additional encoder, the encoder 210 may still be preserved as ashared encoder between passes 206, 208.

During the first pass 206, the encoder 210 receives each acoustic frameof the audio input 202 and generates an output 212 (e.g., shown as theencoding e of the acoustic frame). The RNN-T decoder 220 receives theoutput 212 for each frame and generates an output 222, shown as thehypothesis y_(R), at each time step in a streaming fashion. In otherwords, the RNN-T decoder 220 may consume the frame-by-frame embeddingse, or outputs 212, and generate word piece outputs 222 as hypotheses. Insome examples, the RNN-T decoder 220 generates N-best hypotheses 222 byrunning a beam search based on the received encoded acoustic frames 212.For the structure of the RNN-T decoder 220, the RNN-T decoder 220 mayinclude a prediction network and a joint network. Here, the predictionnetwork may have two LSTM layers of 2,048 hidden units and a640-dimensional projection per layer as well as an embedding layer of128 units. The outputs 212 of the encoder 210 and the prediction networkmay be fed into the joint network that includes a softmax predictinglayer. In some examples, the joint network of the RNN-T decoder 220includes 640 hidden units followed by a softmax layer that predicts4,096 mixed-case word pieces.

In the two-pass model of FIG. 2 , during the second pass 208, the LASdecoder 230 receives the output 212 from the encoder 210 for each frameand generates an output 232 designated as the hypothesis y_(L). When theLAS decoder 230 operates in a beam search mode, the LAS decoder 230produces the output 232 from the output 212 alone; ignoring the output222 of the RNN-T decoder 220. When the LAS decoder 230 operates in therescoring mode, the LAS decoder 230 obtains the top-K hypotheses 222,y_(R) from the RNN-T decoder 220 (e.g., corresponding to the N-besthypotheses generated by the RNN-T decoder 220) and then the LAS decoder230 is run on each sequence in a teacher-forcing mode, with attention onthe output 212, to compute a score. For example, a score combines a logprobability of the sequence and an attention coverage penalty. The LASdecoder 230 selects a sequence with the highest score to be the output232. In other words, the LAS decoder 230 may choose a single hypothesisy_(R) with a maximum likelihood from the N-best list of hypotheses 222from the RNN-T decoder 220. Here, in the rescoring mode, the LAS decoder230 may include multi-headed attention (e.g., with four heads) to attendto the output 212. Furthermore, the LAS decoder 230 may be a two-layerLAS decoder 230 with a softmax layer for prediction. For instance, eachlayer of the LAS decoder 230 has 2,048 hidden units followed by a640-dimnesional projection. The softmax layer may include 4,096dimensions to predict the same mixed-case word pieces from the softmaxlayer of the RNN-T decoder 220.

A neural network is generally trained by backpropagation that defines aloss function (e.g., a cross-entropy loss function). For instance, theloss function is defined as a difference between the actual outputs ofthe network and the desired outputs of the network. To train a modelwith a cross-entropy (CE) loss function, the model trains to optimizethe CE loss function by maximizing the log-likelihood of the trainingdata. Referring to FIGS. 3A-C, a training procedure 300 may train eachcomponent of the speech recognizer 200 on a corresponding set oftraining data 302, 302 a—d (FIG. 3 ). For example, the trainingprocedure 300 for training the two-pass model architecture of the speechrecognizer 200 of FIG. 2 may occur in three stages 310, 320, 330. Duringthe first stage 310, the training procedure 300 trains the encoder 310and the RNN-T decoder 220 (e.g., using a CE loss function). In someexamples, the training procedure 300 trains the encoder 210 and theRNN-T decoder 220 to maximize P(y_(R)=y|x). During the second stage 320,the training procedure 300 trains the LAS decoder 230 without updatingparameters of the encoder 210 or the RNN-T decoder 220. In someimplementations, the training procedure 300 trains the LAS decoder 230using cross-entropy, teaching forcing loss. For instance, the trainingprocedure 300 trains the LAS decoder 230 to maximize P(y_(L)=y|x).During the third stage 330, the training procedure 300 further trainsthe the LAS decoder 230 with a minimum WER (MWER) loss to optimize theexpected word error rate by using n-best hypotheses. For example, theWER objective function models the loss as a weighted average of worderrors in an N-best beam of hypotheses 222. During this third stage 330,the LAS decoder 230 may be fined tuned according to the WER objectivefunction represented by the following equation:

$\begin{matrix}{{L_{MWER}\left( {x,y^{*}} \right)} = {\sum\limits_{y \in B_{LAS}}{{P\left( {y{❘x}} \right)}{\hat{W}\left( {y{❘y^{*}}} \right)}}}} & (1)\end{matrix}$

where y* is the ground truth, BLAS is an N-best list of hypotheses fromthe LAS decoder 230 during a beam search, P(y|x) is the normalizedposterior for the hypothesis y, and Ŵ(y|y*) represents the differencebetween a number of word errors in the hypothesis y and an averagenumber of word errors across the beam. In some implementations, when theLAS decoder 230 functions as a rescorer, the LAS decoder 230 trains tooptimize assigning a high likelihood to the best hypothesis y_(R) fromthe RNN-T decoder 220. Here, this loss optimization function may berepresented by the following equation:

$\begin{matrix}{{L_{MWER}\left( {x,y^{*}} \right)} = {\sum\limits_{y \in B_{{RNN} - T}}{{P\left( {y{❘x}} \right)}{\hat{W}\left( {y{❘y^{*}}} \right)}}}} & (2)\end{matrix}$

where B_(RNN-T) is obtained from the beam search on the RNN-T decoder220. Here, each of these optimization models indicate that the losscriteria represents a distribution over which the speech recognizer 200or part thereof should learn to assign a probability mass.

Referring to FIG. 3B, in some implementations, during the third trainingstage 330 or fine-tuning stage, the training procedure 300 performstraining using MWER loss, but with a modified loss function MWER_(AUG).Here, the modified loss function MWER_(AUG) is a form of proper nounloss augmentation. In this training approach, the loss is configured toemphasize proper noun performance in training. In some examples, theloss emphasizes proper noun performance by increasing a penalty 332applied to the model (e.g., the LAS decoder 230) when the model assignsa high probability to a hypothesis y that misses a proper nouncorrectly. To illustrate, FIG. 3B depicts that during the third stage330 of the training procedure 300, the LAS decoder 230 generates a setof likely hypotheses y_(L), y_(L1-3) that predicts the input 302 d.Here, the input 302 d includes a proper noun Pn, but the LAS decoder 230identifies a first hypothesis y_(L1) being the highest probabilityhypothesis y_(L) for the input 302 d even though it does not actuallyinclude the proper noun Pn. In this example, the modified loss functionMWER_(AUG) applies the penalty 332 because the LAS decoder 230 assignedthe highest probability hypothesis y_(L) to a hypothesis y_(L) thatincorrectly identified the proper noun Pn. In some configurations, thetraining procedure 300 determines that the model (e.g., the LAS decoder230) has assigned a probability to a hypothesis y that satisfies penaltycriteria. The penalty criteria may include that the model assigned aprobability to an incorrect hypothesis for the proper noun thatsatisfies a probability threshold (e.g., exceeds a value assigned to theprobability threshold). Here, the probability threshold may be apreconfigured value that indicates an acceptable level or value for anincorrect hypothesis. In these examples, when the training procedure 300determines that the model (e.g., the LAS decoder 230) has assigned aprobability to a hypothesis y that satisfies penalty criteria, thetraining procedure 300 applies a penalty 332 to the modified lossfunction. In some examples, the modified loss function for proper nounloss augmentation is represented by the following equations:

$\begin{matrix}{{L_{AUG}\left( {x,y^{*}} \right)} = {\sum\limits_{y \in B_{{RNN} - T}}{{P\left( {y{❘x}} \right)}{{\hat{W}\left( {y{❘y^{*}}} \right)} \cdot {C_{\lambda}\left( {y{❘y^{*}}} \right)}}}}} & (3)\end{matrix}$ where $\begin{matrix}{{C_{\lambda}\left( {y,y^{*}} \right)} = \left\{ \begin{matrix}\lambda & {{if}y^{*}{includes}a{proper}{noun}{that}{is}{n{ot}}{}{in}{}y} \\{1,} & {otherwise}\end{matrix} \right.} & (4)\end{matrix}$

for some constant λ>1. Here, λ refers to a hyper parameter selected tobalance the effectiveness of proper noun recognition with respect toperformance of the speech recognizer 200 for general utterances 12. Forexample, configuration of the hyper parameter A tries to avoidincreasing the gradient originating from proper nouns errors at thetradeoff of other error types. In some configurations, proper nouns Pnfor each ground-truth transcription (e.g., training data 302 d) areidentified prior to training by a proper noun identification system 340.To ensure that a hypothesis y includes a proper noun Pn, a hypothesis yis defined as including the proper noun Pn when a hypothesis y containsan entire word sequence of the proper noun Pn in an appropriate order.For example, the proper noun Pn “Cedar Rapids” is contained in thehypothesis “Population of Cedar Rapids,” but not in the hypothesis“Cedar tree height” or “Cedar Rapidsss.”

FIG. 3C illustrates another example of the training procedure 300 thatapplies fuzz training to optimize the ability of the speech recognizer200 to distinguish proper nouns. In this approach, the fuzz trainingaims to teach the speech recognizer 200 how to distinguish betweenproper nouns and phonetically similar, incorrect alternatives. In otherwords, in fuzz training a model, such as the speech recognizer 200,permits the model to gain knowledge of possible mistakes and alternativespellings. During training, when the model (e.g., the LAS decoder 230)assigns a high likelihood to a proper noun mistake, the trainingprocedure 300 imposes a penalty 332 on the model. By imposing thepenalty 332, the training intends to decrease the likelihood of asimilar error in the future.

To train the speech recognizer 200 (e.g., the LAS decoder 230 of thespeech recognizer 200) on these potential mistakes, the fuzz trainingmay perform beam modification. Generally speaking, a beam searchincludes a beam size or beam width parameter B that specifies how manyof the best potential solutions (e.g., hypotheses or candidates) toevaluate. The fuzz training may leverage the beam size B by eitherreplacing hypotheses y from the beam search or expanding on a number ofhypotheses y from the beam search. To illustrate, FIG. 3C is an examplethat depicts a beam search with a beam size of five corresponding to thefive hypotheses y_(L), y_(L1-5) or a beam size of three that has beenexpanded to the five hypotheses y_(L), y_(L1-5). In this example, whenthe beam size is five, a fuzzing system 350 may replace two of thehypotheses with incorrect proper noun alternatives 352, 352 a-b.Similarly, when the beam size is three, the fuzzing system 350 maygenerate additional hypotheses y with incorrect proper noun alternatives352, 352 a-b. In some implementations, the fuzzing system 350 generatesthe proper noun alternatives 352 using a technique called phoneticfuzzing that generates alternatives 352 that are phonetically similar tothe proper noun Pn contained with the training data 302. With phoneticfuzzing, the fuzzing system 350 may even generate new words oralternative spellings that a more traditional corpus of training data302 may not have emphasized or included. For hypothesis y∈B_(RNN-T) andcorresponding to the ground truth y*, the fuzz operation may berepresented by the following equation:

$\begin{matrix}{{{Fuzz}\left( {y,y^{*}} \right)} = \left\{ \begin{matrix}y^{fuzz} & {{if}y^{*}{and}{}y{share}a{proper}{noun}} \\{y,} & {otherwise}\end{matrix} \right.} & (5)\end{matrix}$

In some configurations, the fuzz hypothesis y^(fuzz) is formed bycopying y and then replacing an occurrence of the proper noun Pn with aphonetically similar alternative 352. In fuzz training, the lossfunction is defined by combining the original beam from the RNN-Tdecoder 220 with an alternative 352 (also referred to as a fuzz or fuzzhypothesis). The following equation may represent the loss functionduring the training procedure 300 with fuzz training:

$\begin{matrix}{{L_{Fuzz}\left( {x,y^{*}} \right)} = {\sum\limits_{y \in {B_{{RNN} - T}\bigcup{Fuzz}_{(B_{{RNN} - T})}}}{{P\left( {y{❘x}} \right)}{\hat{W}\left( {y{❘y^{*}}} \right)}}}} & (6)\end{matrix}$

where P(y|x) corresponds to the renormalized posterior that accounts forthe modified beam size (e.g., as represented by the additional term“Fuzz(B_(RNN-T))”). In some implementations, the loss function L of thefuzz training also includes a hyper parameter τ where 0≤τ≤1 such thatthe hyper parameter defines the probability of using the fuzz trainingloss function L_(Fuzz). In these implementations, when the trainingprocedure 300 does not use the fuzz training loss function L_(Fuzz), thetraining procedure 300 uses the loss function as represented by equation(2). Although the hyper parameter T may be set to any probability, insome configurations, the hyper parameter is set to 1 such that thetraining procedure 300 always incorporates the fuzz training lossfunction L_(Fuzz).

In some configurations, the training procedure 300 determines a setnumber of alternatives 352 (e.g., twenty five alternatives 352) for eachproper noun Pn included in the training data set 302 prior to fuzztraining. Here, the number of alternatives 352 generated before fuzztraining may be configured to ensure the diversity of alternatives 352while minimizing computational expense. When the training procedure 300generates the set number of alternatives 352 before fuzz training,during fuzz training, the training procedure 300 may then select randomalternatives 352 that have already been generated as needed.

With continued reference to FIG. 3C, during the third stage 330, thetraining procedure 300 uses fuzz training to train the LAS decoder 230.Here, the LAS decoder 230 receives training data 302, 302 d thatincludes a proper noun Pn and generates five hypotheses y_(L), y_(L1-5)corresponding to the proper noun Pn of the training data 302 d (e.g., abeam width B=5). The LAS decoder 230 also assigns each hypothesis y_(L)a probability (e.g., shown as 0.2, 0.2, 0.1, 0.4, and 0.1) thatindicates a likelihood at which the LAS decoder 230 thinks thatparticular hypothesis correctly identifies the input (e.g., the trainingdata 302). In this example, the fuzzing system 350 generates or selects(e.g., if alternatives 352 are generated prior to fuzz training) twofuzz hypotheses 352 a—b to include in the set of potential hypothesesy_(L), “Belmont” and “Boomundt.” As illustrated in the example, the LASdecoder 230 assigns the highest likelihood (e.g., shown as 0.4) to theincorrect alternative “Belmont” 352 a. Because the LAS decoder 230assigns the highest likelihood to an incorrect alternative 352, thetraining procedure 300 applies a penalty 332 to the fuzz training lossfunction L_(Fuzz). Here, a penalty, such as the penalty 332, providesfeedback during training to adjust weights or parameters of a neuralnetwork. Generally speaking, a penalty functions to steer weightsapplied to a particular input to approach or indicate the intendedoutput rather than an unwanted or inaccurate output. In other words, thepenalty 332 functions to reduce the likelihood that the LAS decoder 230would in the future indicate that an incorrect alternative 352 is likelythe best hypothesis y.

FIG. 4 is a flowchart of an example arrangement of operations for amethod 400 of training a speech recognition model (e.g., the speechrecognizer 200). The method 400 trains a speech recognition model with aminimum word error rate (MWER) loss function by operations 402-408. Atoperation 402, the method 400 receives a training example 302 thatincludes a proper noun Pn. At operation 404, the method 400 generates aplurality of hypotheses y corresponding to the training example 302.Here, each hypothesis y of the plurality of hypotheses represents theproper noun Pn and each hypothesis is assigned a probability thatindicates a likelihood for a respective hypothesis y. At operation 406,the method 400 determines that a probability associated with ahypothesis y satisfies a penalty criteria. The penalty criteriaindicates that (i) the probability satisfies a probability threshold and(ii) the hypothesis incorrectly represents the proper noun. At operation408, the method 400 applies a penalty 332 to the minimum word error rateloss function.

FIG. 5 is schematic view of an example computing device 500 that may beused to implement the systems (e.g., the speech recognizer 200) andmethods (e.g., the method 400) described in this document. The computingdevice 500 is intended to represent various forms of digital computers,such as laptops, desktops, workstations, personal digital assistants,servers, blade servers, mainframes, and other appropriate computers. Thecomponents shown here, their connections and relationships, and theirfunctions, are meant to be exemplary only, and are not meant to limitimplementations of the inventions described and/or claimed in thisdocument.

The computing device 500 includes a processor 510 (e.g., data processinghardware), memory 520 (e.g., memory hardware), a storage device 530, ahigh-speed interface/controller 540 connecting to the memory 520 andhigh-speed expansion ports 540, and a low speed interface/controller 560connecting to a low speed bus 570 and a storage device 530. Each of thecomponents 510, 520, 530, 540, 550, and 560, are interconnected usingvarious busses, and may be mounted on a common motherboard or in othermanners as appropriate. The processor 510 can process instructions forexecution within the computing device 500, including instructions storedin the memory 520 or on the storage device 530 to display graphicalinformation for a graphical user interface (GUI) on an externalinput/output device, such as display 580 coupled to high speed interface540. In other implementations, multiple processors and/or multiple busesmay be used, as appropriate, along with multiple memories and types ofmemory. Also, multiple computing devices 500 may be connected, with eachdevice providing portions of the necessary operations (e.g., as a serverbank, a group of blade servers, or a multi-processor system).

The memory 520 stores information non-transitorily within the computingdevice 500. The memory 520 may be a computer-readable medium, a volatilememory unit(s), or non-volatile memory unit(s). The non-transitorymemory 520 may be physical devices used to store programs (e.g.,sequences of instructions) or data (e.g., program state information) ona temporary or permanent basis for use by the computing device 500.Examples of non-volatile memory include, but are not limited to, flashmemory and read-only memory (ROM)/programmable read-only memory(PROM)/erasable programmable read-only memory (EPROM)/electronicallyerasable programmable read-only memory (EEPROM) (e.g., typically usedfor firmware, such as boot programs). Examples of volatile memoryinclude, but are not limited to, random access memory (RAM), dynamicrandom access memory (DRAM), static random access memory (SRAM), phasechange memory (PCM) as well as disks or tapes.

The storage device 530 is capable of providing mass storage for thecomputing device 500. In some implementations, the storage device 530 isa computer-readable medium. In various different implementations, thestorage device 530 may be a floppy disk device, a hard disk device, anoptical disk device, or a tape device, a flash memory or other similarsolid state memory device, or an array of devices, including devices ina storage area network or other configurations. In additionalimplementations, a computer program product is tangibly embodied in aninformation carrier. The computer program product contains instructionsthat, when executed, perform one or more methods, such as thosedescribed above. The information carrier is a computer- ormachine-readable medium, such as the memory 520, the storage device 530,or memory on processor 510.

The high speed controller 540 manages bandwidth-intensive operations forthe computing device 500, while the low speed controller 560 manageslower bandwidth-intensive operations. Such allocation of duties isexemplary only. In some implementations, the high-speed controller 540is coupled to the memory 520, the display 580 (e.g., through a graphicsprocessor or accelerator), and to the high-speed expansion ports 550,which may accept various expansion cards (not shown). In someimplementations, the low-speed controller 560 is coupled to the storagedevice 530 and a low-speed expansion port 590. The low-speed expansionport 590, which may include various communication ports (e.g., USB,Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or moreinput/output devices, such as a keyboard, a pointing device, a scanner,or a networking device such as a switch or router, e.g., through anetwork adapter.

The computing device 500 may be implemented in a number of differentforms, as shown in the figure. For example, it may be implemented as astandard server 500 a or multiple times in a group of such servers 500a, as a laptop computer 500 b, or as part of a rack server system 500 c.

Various implementations of the systems and techniques described hereincan be realized in digital electronic and/or optical circuitry,integrated circuitry, specially designed ASICs (application specificintegrated circuits), computer hardware, firmware, software, and/orcombinations thereof. These various implementations can includeimplementation in one or more computer programs that are executableand/or interpretable on a programmable system including at least oneprogrammable processor, which may be special or general purpose, coupledto receive data and instructions from, and to transmit data andinstructions to, a storage system, at least one input device, and atleast one output device.

These computer programs (also known as programs, software, softwareapplications or code) include machine instructions for a programmableprocessor, and can be implemented in a high-level procedural and/orobject-oriented programming language, and/or in assembly/machinelanguage. As used herein, the terms “machine-readable medium” and“computer-readable medium” refer to any computer program product,non-transitory computer readable medium, apparatus and/or device (e.g.,magnetic discs, optical disks, memory, Programmable Logic Devices(PLDs)) used to provide machine instructions and/or data to aprogrammable processor, including a machine-readable medium thatreceives machine instructions as a machine-readable signal. The term“machine-readable signal” refers to any signal used to provide machineinstructions and/or data to a programmable processor.

The processes and logic flows described in this specification can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA (field programmablegate array) or an ASIC (application specific integrated circuit).Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for performing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto optical disks, or optical disks. However, a computerneed not have such devices. Computer readable media suitable for storingcomputer program instructions and data include all forms of non-volatilememory, media and memory devices, including by way of examplesemiconductor memory devices, e.g., EPROM, EEPROM, and flash memorydevices; magnetic disks, e.g., internal hard disks or removable disks;magneto optical disks; and CD ROM and DVD-ROM disks. The processor andthe memory can be supplemented by, or incorporated in, special purposelogic circuitry.

To provide for interaction with a user, one or more aspects of thedisclosure can be implemented on a computer having a display device,e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, ortouch screen for displaying information to the user and optionally akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's client device in response to requests received from the webbrowser.

A number of implementations have been described. Nevertheless, it willbe understood that various modifications may be made without departingfrom the spirit and scope of the disclosure. Accordingly, otherimplementations are within the scope of the following claims.

What is claimed is:
 1. A computer-implemented method that when executedon data processing hardware causes the data processing to performoperations comprising: receiving a training example comprising aground-truth transcription, the ground-truth transcription comprising aword sequence that includes a proper noun; and training a speechrecognition model with a fuzz training loss function by: generating anoriginal beam of hypotheses corresponding to the training example, eachhypothesis in the original beam of hypotheses comprising a respectivesequence of words and a corresponding probability that indicates alikelihood that the hypothesis correctly identifies the ground-truthtranscription; generating a fuzz hypothesis corresponding to thetraining example that comprises a respective sequence of words thatincludes an incorrect proper noun alternative word and a correspondingprobability that indicates a likelihood that the fuzz hypothesiscorrectly identifies the ground-truth transcription; determining thatthe corresponding probability associated with the fuzz hypothesis isgreater than the corresponding probability associated with eachhypothesis in the original beam of hypotheses; and based on determiningthat the corresponding probability associated with the fuzz hypothesisis greater than the corresponding probability associated with eachhypothesis in the original beam of hypotheses, applying a penalty to thetraining loss function.
 2. The computer-implemented method of claim 1,wherein the incorrect proper noun alternative word included in therespective sequence of words of the fuzz hypothesis comprises a phoneticsimilarity to the proper noun included in the word sequence of theground-truth transcription.
 3. The computer-implemented method of claim1, wherein the incorrect proper noun alternative word included in therespective sequence of words of the fuzz hypothesis comprises analternative spelling to a spelling of the proper noun included in theword sequence of the ground-truth transcription.
 4. Thecomputer-implemented method of claim 1, wherein the operations furthercomprise, prior to training the speech recognition model with the fuzztraining loss function: identifying the proper noun in the word sequenceof the ground-truth transcription; and generating a set number ofdifferent alternative words for the identified proper noun, whereingenerating the fuzz hypothesis during training the speech recognitionmodel comprises randomly selecting one of the different alternativewords for the identified proper noun as the incorrect proper nounalternative word included in respective sequence of words of the fuzzhypothesis.
 5. The computer-implemented method of claim 1, whereingenerating the fuzz hypothesis comprises: identifying an occurrence ofthe proper noun in the respective sequence of words of one of thehypotheses from the original beam of hypotheses; and replacing theoccurrence of proper noun in the respective sequence of words of the oneof the hypotheses from the original beam of hypotheses with theincorrect proper noun alternative word.
 6. The computer-implementedmethod of claim 5, wherein the operations further comprise combining theoriginal beam of hypotheses with the fuzz hypothesis.
 7. Thecomputer-implemented method of claim 5, wherein the operations furthercomprise substituting the fuzz hypothesis for the one of the hypothesesfrom the original beam search that comprises the respective sequence ofwords including the occurrence of the proper noun.
 8. Thecomputer-implemented method of claim 1, wherein the speech recognitionmodel comprises a two-pass architecture comprising: a first pass networkcomprising a recurrent neural network transducer (RNN-T) decoder; and asecond pass network comprising a listen-attend-spell (LAS) decoder. 9.The computer-implemented method of claim 8, wherein the speechrecognition model further comprises a shared encoder, the shared encoderencoding acoustic frames for each of the first pass network and thesecond pass network.
 10. The computer-implemented method of claim 8,wherein the operations further comprise: training the RNN-T decoder; andprior to training the LAS decoder with the fuzz training loss function,training the LAS decoder while parameters of the trained RNN-T decoderremain fixed.
 11. A system comprising: data processing hardware; andmemory hardware in communication with the data processing hardware, thememory hardware storing instructions that when executed on the dataprocessing hardware cause the data processing hardware to performoperations comprising: receiving a training example comprising aground-truth transcription, the ground-truth transcription comprising aword sequence that includes a proper noun; and training a speechrecognition model with a fuzz training loss function by: generating anoriginal beam of hypotheses corresponding to the training example, eachhypothesis in the original beam of hypotheses comprising a respectivesequence of words and a corresponding probability that indicates alikelihood that the hypothesis correctly identifies the ground-truthtranscription; generating a fuzz hypothesis corresponding to thetraining example that comprises a respective sequence of words thatincludes an incorrect proper noun alternative word and a correspondingprobability that indicates a likelihood that the fuzz hypothesiscorrectly identifies the ground-truth transcription; determining thatthe corresponding probability associated with the fuzz hypothesis isgreater than the corresponding probability associated with eachhypothesis in the original beam of hypotheses; and based on determiningthat the corresponding probability associated with the fuzz hypothesisis greater than the corresponding probability associated with eachhypothesis in the original beam of hypotheses, applying a penalty to thetraining loss function.
 12. The system of claim 11, wherein theincorrect proper noun alternative word included in the respectivesequence of words of the fuzz hypothesis comprises a phonetic similarityto the proper noun included in the word sequence of the ground-truthtranscription.
 13. The system of claim 11, wherein the incorrect propernoun alternative word included in the respective sequence of words ofthe fuzz hypothesis comprises an alternative spelling to a spelling ofthe proper noun included in the word sequence of the ground-truthtranscription.
 14. The system of claim 11, wherein the operationsfurther comprise, prior to training the speech recognition model withthe fuzz training loss function: identifying the proper noun in the wordsequence of the ground-truth transcription; and generating a set numberof different alternative words for the identified proper noun, whereingenerating the fuzz hypothesis during training the speech recognitionmodel comprises randomly selecting one of the different alternativewords for the identified proper noun as the incorrect proper nounalternative word included in respective sequence of words of the fuzzhypothesis.
 15. The system of claim 11, wherein generating the fuzzhypothesis comprises: identifying an occurrence of the proper noun inthe respective sequence of words of one of the hypotheses from theoriginal beam of hypotheses; and replacing the occurrence of proper nounin the respective sequence of words of the one of the hypotheses fromthe original beam of hypotheses with the incorrect proper nounalternative word.
 16. The system of claim 15, wherein the operationsfurther comprise combining the original beam of hypotheses with the fuzzhypothesis.
 17. The system of claim 15, wherein the operations furthercomprise substituting the fuzz hypothesis for the one of the hypothesesfrom the original beam search that comprises the respective sequence ofwords including the occurrence of the proper noun.
 18. The system ofclaim 11, wherein the speech recognition model comprises a two-passarchitecture comprising: a first pass network comprising a recurrentneural network transducer (RNN-T) decoder; and a second pass networkcomprising a listen-attend-spell (LAS) decoder.
 19. The system of claim18, wherein the speech recognition model further comprises a sharedencoder, the shared encoder encoding acoustic frames for each of thefirst pass network and the second pass network.
 20. The system of claim18, wherein the operations further comprise: training the RNN-T decoder;and prior to training the LAS decoder with the fuzz training lossfunction, training the LAS decoder while parameters of the trained RNN-Tdecoder remain fixed.